\section{Core Data Model}
\label{sec:datam}
As reported in Section~\ref{sec:sota}, the core data model of MSR4J
depicted in \figurename~\ref{fig:cm}, takes into consideration the
models from~\cite{maclean13, squire13} and CVSAnaly2. 
This model contains 6 types of node and 22 types of relationship.

\subsection{Committer Node}
A committer is a project participant, registered as such in the project team listing,
publicly available, or whose identifier has been discovered in the project's SCM logs.
A Committer node is characterised by its unique identifier within the project 
(often identifying the commits of the participant in the SCM), name, email, 
and web page URL. Additionally, for technical reasons involving the ASF case study
and the underlying framework MSR4J is built upon, we included boolean
properties related to ASF membership and emeritus member status. 
This is discussed later in Section~\ref{sec:archi}.

\begin{figure}[!t]
\centering
\includegraphics[scale=0.55]{DataModel-v3}
\caption{Core Data Model of MSR4J}
\label{fig:cm}
\end{figure}


A Committer node is related to Commit nodes through instances of the \textit{PROPAGATE}
relationship, which carries no properties. The \textit{TOUCH} relationship connects a
committer to a file s/he has added, deleted, modified or replaced in some revision (hold 
by a Commit node) within the considered dataset. It has a property indicating the number
of revisions in which the author has touched the file in some manner. 
Any pairs of committers who have touched at least once a common file are connected with the \textit{CO\_COMMITTER} relationship.

A committer may (have) hold administrative roles in a project, which would result in
the Committer node being connected to the Project node with instances of the \textit{HOLD\_ADMIN\_ROLE}
relationship. This relationship has two properties, indicating the start and (possibly)
finish dates of the administrative duty. Two committers working in the same project are connected
with the \textit{SAME\_PROJECT\_AS} relationship. Being registered in at least the same repository
implies being connected with that relationship, but the other way round is not always true,
since a project may have many distinct repositories.

The \textit{REGISTERED\_IN} relationship connects a committer to a repository where s/he has
propagated at least one commit. It has two properties, indicating the first commit and
(possibly) last commit date within the considered dataset. Note that a last commit date does
not necessarily indicates that the committer has left the project or unregistered from the
repository.

Whenever a committer has forked a repository, the representing nodes are connected with the
\textit{FORK} relationship, which has two properties: the revision number, matching a Commit node
identifier, and the URI of the new repository, matching the unique URI of a Repository node.

\subsection{Commit Node}
The Commit node is characterised by a unique revision number identifying the corresponding
commit into the repository, date, log message and name of the actual author on behalf of whom
the change is propagated, when this information is available. 
The type of change committed to a File is represented by one of the relationships:
\textit{ADD, MODIFY, DELETE, RENAME, COPY} and \textit{REPLACE}. Each of them carries at most
two properties, indicating the number of lines added and/or removed.

The \textit{IMPACT} relationship connects a Commit node to the Repository where it took place.
When the commit tags or forks the repository then this relationship carries a property holding that information. 

\subsection{Repository Node}
The Repository node models the document management system where artefacts of a project are
versioned. It is most probably a software repository (SCM), but it could also be any other
type of repository deemed relevant. It is characterised by a name, unique URI, type (e.g. CVS, SVN)
and whether it is at the root of a hierarchy of repositories. The \textit{PARENT} relationship is
hence meant to establish hierarchical relationships between repositories.

The \textit{ATTACHED\_TO} relationship connects a repository to the project(s) it is part of.
When a repository is forked, it is connected to the new repository with the \textit{FORKED\_INTO}
relationship. A repository connects to all Files it contains via the \textit{CONTAIN} relationship.

\subsection{Project Node}
The Project node models a software project. In this core model, it simply has a name and
website URL as properties. The  \textit{INVOLVE} relationship connects a project to all
its participants. The hierarchical relationship between projects is modelled by \textit{TOP\_LEVEL}.

\subsection{File Node}
The File node models files and directories in the repository they belong to.
Properties are the (unique and immutable) fully qualified path and whether it is a directory.
Since paths are immutable, the relationships \textit{REP\_BY, COP\_TO} and \textit{REN\_TO} help
identify files replacement, copy and renaming, respectively.

The relationship \textit{FILE\_TYPE} connects a file to its type.

\subsection{FileType Node}
This node models the type of a file. It has 3 properties: the actual type (e.g. code, build, 
image, documentation, etc.), the file extension and the programming language if any.

We assume the type of a file doesn't change, otherwise it will be hinted in the commits by a renaming
or replacement action.

\subsection{Limitations of the Core Model}
\label{sec:limitcm}
This model does not take into consideration branches in a repository.
We consider this concept difficult to perceive, the interpretation of which is rather ambiguous.
In contrast, tags and forks be properly figured out. 

Not all features from software forges, as identified by Squire and Williams in \cite{SquireW12},
are included in this model. In particular, there is no clear identification of artefacts issued
in mailing lists, bug trackers, tasks managers, forums, wikis etc. However, Repository
and File, associated with FileType can be used as meta concepts to represent them.

Since this model is easily extendable, we believe every implementation could add relevant
concepts to them, keeping in mind to reuse as much as possible the existing ones.


